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Abstract 

In a landscape composed of N randomly distributed sites in Euclidean space, a 
walker ("tourist") goes to the nearest one that has not been visited in the last 
T steps. This procedure leads to trajectories composed of a transient part and a 
final cyclic attractor of period p. The tourist walk presents a simple scaling with 
respect to r and can be performed in a wide range of networks that can be viewed 
as ordinal neighborhood graphs. As an example, we show that graphs defined by 
thesaurus dictionaries share some of the statistical properties of low dimensional 
{d = 2) Euclidean graphs and are easily distinguished from random link networks 
which correspond to the d ^ 00 limit. This approach furnishes complementary 
information to the usual clustering coefficient and mean minimum separation length. 



1 Introduction 

We live in a world formed by networks: biological, social, linguistic and techno- 
logical. The study and characterization of such networks have boosted recently 
by the emergence of new ideas, increasing network databases and available 
computational power to test models and link them to data [1]. 

Networks form the substrate where dynamical processes can occur, as the 
spreading of diseases or diffusion of information. Such processes are usually 
studied by using stochastic dynamics and random walks. Navigation processes 
and exploratory behavior [2,3] can also be modeled by walks inside graphs. 
Of course, navigation processes are not purely random nor purely determin- 
istic. However, as a first step, it would be interesting to know the generic 
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properties of deterministic navigation inside networks [4-6] . If these networks 
are disordered, deterministic walks will lead to some statistics of trajectories 
which could provide information about the network topology and kind of dis- 
order. Consider the following examples of deterministic walks in the context 
of translation machines and thesaurus graphs. 

Automatic translation software presents various problems due to the difficult 
task, even to humans, of converting phrases from one language to another. For 
example, it seems a desirable feature that if a translated sentence is translated 
anew to the original language one should get the original phrase. However, the 
iteration of the translation process in standard translator software frequently 
produces a drift from the original sense. After a transient, this drift achieves 
a two-cycle that constitutes a pair of usually nonsensical sentences, which are 
the same when translated back and forth. 

At the word level, a similar phenomenon occurs in analogical dictionaries or 
thesaurus. Starting from a random word, if one iterates the process going to 
the nearest synonymous in some sense (say, for example, the first word in the 
given list), one achieves also cycles of period two as shown in Table 1. 

1) link — > connection — ^ link 

2) translation — conversion — change — alter — change 

3) constitution charter contract agreement accord agreement 

4) constitution — establishment — organization — association — friendship — 

companionship company corporation — > business — > commerce — ^ trade — ^ 

deal contract agreement accord agreement 
Table 1 

Examples of two-cycle obtained from iteration to the first word in synonymous list 
(Microsoft Word 98): 1) zero transient trajectory; 2) trajectory with two transient 
steps; notice that trajectories may differ in 3) United Kingdom EngUsh and 4) USA 
English thesaurus. 

This iterative procedure converges, after some (sometimes large) transient 
time, to a two-cycle. Two-cycles are the universal attractors of this iteration 
process. This can be easily understood if one thinks about words in a thesaurus 
as nodes of a graph. This graph is not purely random but their nodes are 
semantically or etymologically linked defining some ordinal relationship in the 
node neighborhood (first neighbor, second neighbor etc.). Two-cycles appear 
when two mutually next neighbors are finally found. 

An interesting statistical question concerns the average degree of separation 
between two words in a thesaurus. After some experimentation in standard 
electronic thesaurus, one finds that the average degree of separation is of 
order log A'", where N is the number of nodes (words) in the graph defined by a 
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thesaurus. One also observes that synonymous have large clustering coefficient. 
This means that the probability that two synonymous of a word are also 
synonymous of each other is large. This suggests that a thesaurus is an instance 
of the so-called Small World networks [7] . This fact has been confirmed recently 
through exhaustive quantitative measures in the Merrian- Webster dictionary 
[8], Roget's thesaurus [9] in the Wordnet Database [9,10] and even for free 
word associations database [9]. 

Here we suggest a different point of view to probe the structure of these highly 
non-trivial networks, through the study of the statistics of a simple determin- 
istic walk on them: the "tourist walk" [6] . Starting from an initial site, a walker 
(tourist) moves at each time step according to the simple deterministic rule: 
go to the nearest site that has not been visited in the preceeding r time steps. 
The tourist performs a deterministic partially self-avoiding walk with mem- 
ory window of r steps. This iterative dynamics always leads to a trajectory 
that is composed of an initial transient part and a final attractor (a cycle of 
period p > r + 2) that traps the tourist. Here, r is the limited tourist memory 
range or, alternatively, it can be thought as a refractory time of the sites. 
For T = 0, only cycles with period 2 appear, which correspond to a pair of 
mutually nearest neighbors points, corresponding to the two-cycle phenomena 
described earlier. 

The main result presented in this paper is that deterministic walks in spa- 
tial graphs (graphs whose nodes lie in Euclidean dimensional space) present 
some interesting statistical regularities. These regularities could serve as a 
benchmark when studying real world networks. In Section 2 wc show that the 
tail of the distributions Pt{p) of cycle period p, i.e., the number of different 
p-cycles divided by the total number of cycles, can be described in the whole 
r-range by a limited power- law. In this way all curves Pr{p) can be collapsed 
into a single universal curve independent of r. This is an extension and im- 
provement to the description given in Ref. [6]. Also, for r = 0, an analytical 
support, given by Cox's formula [11-14], is presented along the text. In ad- 
dition to the quantities studied in Ref. [6], another quantity is examined: the 
probability Pa{p) of a random site to belong to a p-cycle attractor basin. This 
probability is interesting since it shares the scaling properties of Pr{p) and is 
easily measurable from experimental data. These quantities have been studied 
as a function of the dimensionality d of the system where we show a slow con- 
vergence to the random link network. The random link network corresponds to 
the limit d — > oo and, for r = 1, the cycle distribution behaves as Pt{p) oc p'^. 
In Section 3, we compare our results, obtained in low dimensional Euclidean 
random graphs, to those obtained in the Small World graphs of thesaurus and 
find surprising similarities between them and huge differences from the d ^ oo 
limit. This suggests that the short range connectivity structure of thesaurus 
graphs could be embedded in a low dimensional space, with the Small World 
behavior given by a small fraction of long range connections. This contrasts to 
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standand models which represent words in high dimensional Euclidean space 
[9,15]. Finally in Section 4 concluding remarks are presented. 



2 Deterministic Walks 

The construction of the (i-dimensional Euclidean random graphs starts by 
randomly distributing the coordinates of j = 1, 2, . . . , sites with a uniform 
density p in the interval [0, 1] in d dimensions. The Euclidean matrix of dis- 
tances Dij is used for ordering the neighbors and to create a neigborhood table 

Vik — where j{k) is the site that is the k^^ nearest neighbor of site i. 

Notice that a random collection of points in Euclidean space does not define a 
graph because a set of links is also necessary. If all the points are connected by 
links, we would have a totally connected graph. We will represent this graph 
by Gm-1, where — 1 is the number of outgoing links. But with a memory 
window r, the walker dynamics is well defined if one knows what are the r + 1 
nearest neighbors. Moves to the other neighbors never occur. This means that 
the walk is done inside a subgraph G-^+i formed by linking directionally each 
site i to its T + 1 nearest neighbors. The tourist walk is always performed 
inside this directed subgraph that only presents an ordered set of neighbors 
Vik {k = 1, . . . ,T + 1) for each node. For example, in the Gi subgraph each 
node is connected only to its nearest neighbor: indeed this graph is composed 
by several disconnected parts, one for each pair of mutual nearest neighbor. In 
the G2 subgraph, each node is linked to its two nearest neighbors. Such graph 
has a giant component (percolating cluster) containing more than 98% of the 
nodes [16]. Notice that Gn is always a subgraph of Gn+i- 

We stress that the Euclidean metric has been introduced in our model only 
as a simple mode of ranking the sites neighbors. As pointed out in Ref. [6], 
the dynamics is performed on the neighborhood table Vik, not in the distance 
matrix Da. 



2.1 Walks without memory 

The density of cycles DT-{p;d) is the number of different p-cjcles divided by 
A^ for a given memory r and dimension d. As observed before, in the no 
memory situation t — 0, only 2-cycles appear. Nearest neighbor statistics in 
Poisson processes on Euclidean spaces are well known and have been used 
more intensively after the work of Clark and Evans [11], followed by Refs. 
[12-14]. They have defined "refiexive nearest neighbors" as two sites that are 
the nearest neighbors of each other. These reflexive neighbors are precisely the 
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attractors in the r — tourist walk. 

Applying Cox's formula [14], the density Dq{2; d) of 2-cycles in dimension d is 
given by: 



Do{2,d) 



Prn{d) 



2 2(1 

X 1/2 

r(d/2 + r 



(2) 



with Prn{d) being the fraction of reflexive neighbors and T{z) the gamma 
function. 

These formula can be worked out to obtain standard functions. Observing 
that: 



r(rf/2 + i) 



1 



^T[{d + l)/2] 5[l/2,(rf+l)/2] ' 

where B[a, h) = T{a)T{h) /T{a + h) is the beta function, one obtains: 

1 d + 1 



(3) 



(4) 



with /^(a, h) being the incomplete beta function. Some special values of these 
quantities are shown in Table 2. 



d 


Pd 


Do{2,d) 


1 


1/2 


1/3 


2 


(27r + 3^/2)/(67r) 


37r/(27r + 3^/2) 


3 


11/16 


8/27 


00 


1 


1/4 



Table 2 

Some values of and -Do (2, d) as a function of d. 



2.2 Walks m d = 2 



We consider now the dependence on the memory r in the two dimensional 
case. For r > 0, different period cycles coexist. Examples of cycles appearing 
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for a landscape with N — 400 sites and several values of memory r are given 
in Fig. 1, where open boundary condition has been used here. As r grows, one 
observes the progressive disappearance, development or coalescence of cycles. 



A natural question concerns the distribution of cycle periods 



and also if there is a value for r where a percolation phenomenon occurs, i.e., 
the appearance of a cycle with diameter comparable to the system. However, 
the detailed study of this percolating regime demands strong computational 
effort, which is far from our resources. Here we only report results for the 
also interesting regime r = O(logA^), which is of the order of the mean mini- 
mum separation length in Small Word networks. This regime gives information 
about the clustering properties of the network at scale of 0(r) cities. 

Numerical simulations, using periodic boundary conditions, have been per- 
formed for d = 2 and r varying from 1 to 10 with N = 10'^ and averaging over 
100 landscapes, see Fig. 2a. We have found that, for p > 2pmm = 2(r-|-2), the 
probability of appearance of a p-cycle Pt{p) can be fitted by the expression: 

P^{p) = C(t) p-" e-[^'/P°(^)l' , (6) 



The fitting function gives a ~ 2.6 ± 0.1 independent of r. We observed that 
the cutoff lenght scales as po oc r, suggesting that all these curves could be 
scaled into a single universal function which depends only on the ratio p/r. 
As the areas under the Pr{p) curves are always equal to one, we have found 
the data collapse on the function: 

G(p/r) = rPrip) , (7) 



as illustrated in Fig. 2b. It is worthwhile mentioning that other scalings as 
{p - Pmin) / Pmin Or p/pmin with p^j„ = T + 2 have been tested and do not work 
as well as the one proposed above. A similar scaling docs not hold for d — 1 
since in this case the exponent a depends strongly on r, see Ref. [6]. 



2.3 High Dimensional Walks 



For d > 1, the Pr(p) curve slowly converges to the ci — > oo behavior, which 
is also the behavior of the random link network (Fig. 3a), which neglects the 



6 



correlations among the distances between sites in a d-dimensional Euclidean 
space [17-19]. The distribution tail has the form 

P,(p)ocp-"W0(p,r,iV), (8) 

with a{d) varying from a (2) fa 2.6 to a{oo) — 1 and r, A^) being a cutoff 
function. 



2.3.1 Convergence to the random link behavior 

The random link network has been generated numerically considering the 
distances between two points as uniform random variables in the interval [0, 1]. 
The distribution Pr{p) for the random link network can be described by a 
power law p"^ (see Fig. 3a) with a cutoff which grows with the system size N, 
since it is a version of a random map model [20]. Notice, however, that we have 
studied a symmetrical situation {Dij = Dji) with r = 1. The Derrida random 
map [21] corresponds to an assymetrical random link case {Dij ^ Dji), it 
presents cycle periods p >2 even for r = and the total number of attractors 
is of order log A^. For t — 0, the symmetric case presents only two-cycles and is 
the correct approximation for the limit d — > oo, with the number of attractors 
scahng as 0.25A'". 

The slow convergence to the d ^ oo behavior also appears in other statistical 
quantities as, for example, the probability Pa{p) that a random initial site 
leads to a cycle-p (Fig. 3b). When studying deterministic walks in real world 
networks, the quantity Pa{p) is more convenient from an experimental point 
of view. One can obtain better statistics for Pa{p) since each different initial 
condition gives valid data and it is not necessary to keep track of only different 
cycles, as stated by the definition of Pt{p)- 



3 Thesaurus graphs 

To exemplify the use of the tourist walk in ordinal ranking problems, we 
have considered r = and r = 1 walks in a graph defined by a thesaurus. 
In this situation a word is chosen at random and uniformly from a standard 
dictionary. The chosen word is then placed in an electronic thesaurus (the USA 
English Microsoft 98 thesaurus). In the considered thesaurus, the synonyms 
are ranked according to their use frequency. 
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3.1 No memory 



For a T = walk, the more frequent used synonym of a given word (the first 
word in the synonymous hst or "first neighbor" ) is then chosen as a new input 
and, again, its first neighbor is chosen in an iterative process, as illustrated 
in Table 1. This procedure leads to a transient and (generally) to a cycle of 
period 2, where two nearest reflexive synonyms are found. The distribution of 
transient times presents an exponential decay (Fig. 4). 

The proportion of reflexive neighbors could be estimated in principle by mea- 
suring the fraction of initial words that are already in a two-cycle (like the 
word "link" in Table 1). From 1000 initial words (~ 2% of the whole dic- 
tionary), 272 of them are found to belong to two-cycle. This value leads to 
Prn = 0.272ib0.028, with a confidence level 7 = 95%, which is very far from 
the range [1/2,2/3] expected for random points in (i-dimensional Euclidean 
space. However, it is clear from the distribution of the transient times (Fig. 4) 
that this number has been underestimated. One of the reasons for this un- 
derestimation is that a large fraction of two-cycles are composed by cycles 
containing two word expressions as: "punctual - on time - punctual" . Since 
we have started always from single word expressions, the cycle: "on time - 
punctual - on time" will never be caught as a transient zero exemplar. An- 
other source for this discrepancy may be the presence of a small fraction of 
random links (see bellow) . Extrapolating the exponential decay in transient to 
t — Q leads to a better estimate P{t = 0) = Pm — 0.618, which is compatible 
with a low dimensional d = 2 (or maybe d — 3) estimate by Cox's formula 
Prn{2) = 0.62 (P,„(3) = 0.59). 

We shall also mention that some cycles of higher period have been found in 
the dictionary experiment: (5.2 ± 1.4)% of thrcc-cyclcs and (0.5 ± 0.4)% of 
four-cycles with confidence level 7 = 95%. Thus, the examined dictionary 
may present links where B is the nearest synonymous of A, C is the nearest 
synonymous of B and A is the nearest synonymous of C. This kind of rela- 
tion cannot be embedded in a metric space (due to the triangle inequality), 
suggesting that this thesaurus graph is slightly non-metric. We do not know if 
this is either a significant or desirable property, or an artifact from the specific 
dictionary examined. 



3.2 One step memory 

For r = 1 walks, we have found that the probability Pa{p) of a word to belong 
to the basin of a p-cycle is very similar to that from walks in low dimensional 
Euchdean spaces, as shown in Fig. 5. This result is amazing and suggests that 
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the lexicon graph could be embedded in a low dimensional Euclidean space 
preserving most of its neighbor order. 

Notice that we have worked only with the Gi {r — 0) and G2 (t = 1) sub- 
graphs, that is, graphs containing only one or two nearest neighbors of each 
point. So, we are examining the local structure of the network, not the large 
distance links which give the Small World behavior or power law degree of con- 
nectivity [9,10]. We have found that these subgraphs are very similar to those 
generated by a random collection of points in Euclidean space with d — 2 {or 
d — 3). It must be emphasized that higher values for d are excluded because 
this would be reflected already in the Gi and G2 structure. The d = 1 case is 
also excluded once cycles with periods 5, 6 and 7 have been measured and are 
forbiden by the model [6]. 

This result contrasts to standard models which represent words in high dimen- 
sional {d > 100) Euclidean space, for instance, the Latent Semantic Analysis 
(LSA) model [9,15]. If the present thesaurus were represented as points in such 
high dimensional spaces, the tourist walk behavior should be similar to the 
random link behavior. Instead, our findings suggest that, for best representing 
its local behavior, the nodes (words) from the studied thesaurus should be 
embedded in a low Euclidean dimension, with a small fraction of assymetric 
links (to produce cycles with p > 2 for r = 0). Of course, as indicated by other 
studies [9,10], there are also long range links that give the Small World/Scale 
Free character of the network. 

We have repeated this kind of study with the Portuguese Microsoft Word98 
thesaurus. However, the existence of a large fraction of dead ends, i. e., nodes 
without outgoing links (reflecting a low quality database) prevented further 
analysis in this dictionary. 



4 Conclusion 

We have shown that simple deterministic walks inside graphs with ordered 
neighborhood produce the emergence of cycles with interesting statistical 
properties. The spectrum of stable cycles Pt{p) and the probability Pa{p) 
of falling in a p-cycle from a random start give complementary information 
to measures like the average minimal length L and clustering coefficient G. 
Networks with the same L and G could be distinguished by different spectrum 
Pr{p) and Pa{p). 

Walks with no memory (r = 0) easily detect absence of metrics in the graph. 
The curves Pt{p) and Pa{p) given here deflne class behaviors for the less 
informative spatial distribution of nodes (the Poisson case in Euclidean space) 
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that can be used as a benchmark for results from walks in other networks. Since 
graphs with ranked neighborhood, but random structure, abound in the real 
world (from neural and ecological to social networks [8]), it could be interesting 
to find if deterministic walks on them belong, or not, to the same class defined 
by the Euclidean random graphs studied here. A surprising result is that the 
graph G2 generated by a thesaurus is best represented by random points in 
low dimensional, even d — 2, Euclidean space. 
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Figure captions: 



Figure 1: Examples of emergent cycles for d = 2 and N = 400. The memory 
values are: r — 1, 3, 5 and 9. The same landscape has been used for each 
plot with open boundary condition. The growing, distabilization and fusion of 
cycles can be observed as a function of the memory r. Points not belonging 
to a cycle are not displayed. 

Figure 2: a) Distribution of cycles periods -Pr(p) ior d — 2 and different values 
of T (from left to right, r = 1, 2, 3, 4, 5, 6, 10). Notice that Pr{p) is defined only 
for integer p (as in the r = 1 curve) nevertheless lines have been used for 
better visualization, b) Data collapse using Eq. 7, G{p/t) = tPt{p) versus 
p/t for T = 1, 2, 3, 4, 5, 6, 10 (from right to left). 

Figure 3: a) Distribution of cycle periods Pr{p) where the dotline represents 
the Prip) oc p~^ fit. b) Distribution of cycle periods Pa{p)- Curves are for 
T — 1 and from downwards to upwards with: d = 2, 4, 8, 16, 32 and symmetric 
random hnk. Other values are: N — 10^ points averaged for 10^ landscapes 
with periodic boundary conditions. 

Figure 4: Distribution of transient times for MS98 USA English thesaurus 
with T — and N — 1000 initial words. Fitting the points with P{t) = ce'"^^ 
leads to the extrapolation P(0) — c — 0.618 (open circle), which can be 
compared to Prn(2) = 0.6215. 

Figure 5: Comparison between fraction of sites Pa{p) belonging to p-cjc\e 
attractor distribution of cycles for r = 1 in random Euclidean graphs with 
d = 2 (circles) for N = 1000 initial words of MS98 USA English thesaurus 
(triangles) and random link graphs (squares). 
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